Author

Whitney Hollman

#knitr kable() #can add a caption to the table using kable_styling() #gt is the best for styling.

Code
options(scipen = 999)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5     ✔ rsample      1.2.0
✔ dials        1.2.0     ✔ tune         1.1.2
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.1     ✔ yardstick    1.2.0
✔ recipes      1.0.8     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
Code
library(glmnet)
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-8
Code
library(discrim)

Attaching package: 'discrim'

The following object is masked from 'package:dials':

    smoothness
Code
library(rpart)

Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune
Code
library(rpart.plot)
library(baguette)
library(kernlab)

Attaching package: 'kernlab'

The following object is masked from 'package:scales':

    alpha

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    alpha
Code
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
Code
library(here)
here() starts at /Volumes/palang/STAT_551/Stat_Learning_wk1
Code
library(parsnip)
library(kknn)

Unemployment in the U.S.

Dataset from kaggle.com

Code
unemployment <- read.csv(here("Project_1", "Unemployment in America Per US State.csv"))

Context

What is the primary goal of this data analysis project?

Primary goal is unemployment rates used as a time series data set from 1976 to 2022 to predict homelessness in the United States, and the states with consistent higher rates of unemployment also have larger non institutional populations.

What are some secondary goals of this data analysis project?

The hope that the findings will help bring awareness of high or low employment rates, and what areas of the country are affected the most, etc.

Who is the target audience of this project? (Do not think of this as an academic paper. Instead, choose specific organizations or companies that you think might be interested in your results.)

State government labor offices, Companies with high unemployment rates.

How will the results of this project be used by the target audience?

If the data is able to predict future outcomes of the correlation between homelessness and unemployment, then a community outreach should be formed in these highly predictive areas in order to improve the lives of the persons who are destabilized economically. The roll of the government should be to create jobs that target the skills of the unemployed, etc.

Data Understanding & Previous Work

Who are the data providers?

Data was taken from the Bureau of Labor Statistics, and complied by Jason Oh the author of the data set on Kaggle.

What do their organizations do?

Government

Why did they create these datasets?

The dataset was created because the author believed that unemployment is significant issue in America. This issue contributes directly to the homelessness crisis. The author stated that by achieving quantitative, yet objective, viewpoints of a subject such as homelessness is crucial to be able to understand current issues in in America.

How was the data collected?

Data was compiled directly from Bureau of Labor Statistics by the author. The dataset tracks relevant population statistics and employment rates per US state, since 1976.

  1. The Bureau of Labor Statistics’s Economic News Release on (Monthly) State Employment and Unemployment - The Bureau of Labor Statistics has published monthly updates on unemployment rates since January 1976
  2. The Bureau of Labor Statistics’s State Employment and Unemployment Technical Note - The Bureau of Labor Statistics released a detailed overview of their unemployment data, the methodology behind their data, and the proper definitions and terminologies for the variables tracked. The guide mainly provided essential contextual knowledge needed to create a meaningful dataset

What are the cases, i.e., the rows of the datasets?

Rows are numerical.

What variables will you most likely use as your target variables? What do these variables measure, and how are they quantified?

State. Year.

What variables will you consider using as predictors? What do these variables measure, and how are they quantified?

Total unemployment by state. Total employment by state.

What data cleaning or manipulation might be necessary to prepare this data for modeling?

Renaming for ease of analysis. Quick plots, and linear regression, summary, counting, etc to get a better handle of the data.

Exploratory Analysis

Statistics Being Tracked

Column Names and Variables:

  • FIPS Code of State/Area(Federal Information Processing. Unique codes for states and counties, that are uniquely identified geographically).

  • Year/Month of Statistic

  • Total Civilian Non-Institutional Population in State/Area (All U.S. civilians not residing in institutional group quarters facilities such as correctional institutions, juvenile facilities, skilled nursing facilities, and other long-term care living arrangements. Are unemployed but looking for work)

  • Total Civilian Labor Force in State/Area

  • Percent (%) of State/Area’s Population

  • Total Employment in State/Area

  • Percent (%) of Labor Force Employed in State/Area

  • Total Unemployment in State/Area

  • Percent (%) of Labor Force Unemployed in State/Area

sd#### Data Cleaning

Code
unemployment <- read.csv(here("Project_1", "Unemployment in America Per US State.csv"))
Code
if(any(is.na(unemployment))) {
  pring("There are non-finite values in the data set.")
} else {
  print("There are no non-finite values in the data set.")}
[1] "There are no non-finite values in the data set."
Code
unemployment <- unemployment %>%
  clean_names()
Code
sapply(unemployment, class)
                                                fips_code 
                                                "integer" 
                                               state_area 
                                              "character" 
                                                     year 
                                                "integer" 
                                                    month 
                                                "integer" 
total_civilian_non_institutional_population_in_state_area 
                                              "character" 
                 total_civilian_labor_force_in_state_area 
                                              "character" 
                       percent_of_state_area_s_population 
                                                "numeric" 
                           total_employment_in_state_area 
                                              "character" 
            percent_of_labor_force_employed_in_state_area 
                                                "numeric" 
                         total_unemployment_in_state_area 
                                              "character" 
          percent_of_labor_force_unemployed_in_state_area 
                                                "numeric" 
Code
#| label: combine month and year columns into one date column
#clean_unemployment <- unemployment %>%
 # mutate(state_date = make_date(year, month)) %>%
#  select(-month, -year)

# I dont believe the month to be relevant for this analysis, and will be removing the month column.

clean_unemployment <- unemployment %>%
  select(-month) 

# Switch year to numeric for ggplots
clean_unemployment$year <- as.numeric(clean_unemployment$year)

I mutated the month and year into one column called state_date, that will essentially allow me to use time series analysis on the data set.

Code
# FIPS code above 52 is for counties remove New York City
# Left Los Angeles County in the data set because it is the most populous county in the United States, and the size of some entire state populations. Another reason for keeping it is because removing it increased noise in some plots.

clean_unemployment <- clean_unemployment %>%
  filter(fips_code != 51000)

clean_unemployment %>%
  select(state_area, fips_code) %>%
  distinct()
             state_area fips_code
1               Alabama         1
2                Alaska         2
3               Arizona         4
4              Arkansas         5
5            California         6
6    Los Angeles County        37
7              Colorado         8
8           Connecticut         9
9              Delaware        10
10 District of Columbia        11
11              Florida        12
12              Georgia        13
13               Hawaii        15
14                Idaho        16
15             Illinois        17
16              Indiana        18
17                 Iowa        19
18               Kansas        20
19             Kentucky        21
20            Louisiana        22
21                Maine        23
22             Maryland        24
23        Massachusetts        25
24             Michigan        26
25            Minnesota        27
26          Mississippi        28
27             Missouri        29
28              Montana        30
29             Nebraska        31
30               Nevada        32
31        New Hampshire        33
32           New Jersey        34
33           New Mexico        35
34             New York        36
35       North Carolina        37
36         North Dakota        38
37                 Ohio        39
38             Oklahoma        40
39               Oregon        41
40         Pennsylvania        42
41         Rhode Island        44
42       South Carolina        45
43         South Dakota        46
44            Tennessee        47
45                Texas        48
46                 Utah        49
47              Vermont        50
48             Virginia        51
49           Washington        53
50        West Virginia        54
51            Wisconsin        55
52              Wyoming        56

I removed the New York City FIPS code because it is not a state, and it is not a county. It is a city. The FIPS code for New York City is 51000. For example, the FIPS code for New York State is 36. The FIPS code for New York County is 36061. The FIPS code for Kings County is 36047. The FIPS code for Queens County is 36081. The FIPS code for Bronx County is 36005. The FIPS code for Richmond County is 36085.

Code
clean_unemployment[1:6,] %>%
  get_one_to_one()
[[1]]
[1] "fips_code"                                                
[2] "state_area"                                               
[3] "total_civilian_non_institutional_population_in_state_area"
[4] "total_civilian_labor_force_in_state_area"                 
[5] "percent_of_state_area_s_population"                       
[6] "total_employment_in_state_area"                           
[7] "percent_of_labor_force_employed_in_state_area"            
[8] "total_unemployment_in_state_area"                         
[9] "percent_of_labor_force_unemployed_in_state_area"          
Code
#Function for removing and converting all columns
remove_commas_and_convert_to_numeric <- function(x) {
 as.numeric(gsub(",", "", x))
}

clean_unemployment <- clean_unemployment %>%
  mutate(across(starts_with("total"), remove_commas_and_convert_to_numeric))
Code
clean_unemployment$region <- case_when(
   clean_unemployment$state_area %in% c("California", "Los Angeles County", "Oregon", "Washington", "Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Montana", "Wyoming", "Alaska", "Hawaii", "Utah") ~ "West",
  clean_unemployment$state_area %in% c("North Dakota", "South Dakota", "Nebraska", "Kansas", "Minnesota", "Iowa", "Missouri", "Wisconsin", "Michigan", "Illinois", "Indiana", "Ohio") ~ "Midwest",
  clean_unemployment$state_area %in% c("Texas", "Oklahoma", "Arkansas", "Louisiana", "Mississippi", "Alabama", "Georgia", "Florida", "South Carolina", "North Carolina", "Virginia", "Tennessee", "Kentucky", "Delaware", "Maryland", "Washington D.C.", "West Virginia", "District of Columbia") ~ "South",
  clean_unemployment$state_area %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "Pennsylvania", "New York") ~ "Northeast",
  TRUE ~ "Other"
)
Code
clean_unemployment <- clean_unemployment %>%
  mutate(percentage_total_civilian_non_institutional_pop = round((total_civilian_non_institutional_population_in_state_area / sum(total_civilian_non_institutional_population_in_state_area)) * 100, 4)) 
Code
clean_unemployment <- clean_unemployment %>%
  relocate(region, .after = state_area)
Code
clean_unemployment <- clean_unemployment %>%
  arrange(state_area, year) %>%
  mutate(lagged_value = lag(total_unemployment_in_state_area),
         percentage_change = 
(total_unemployment_in_state_area - lagged_value)
      /lagged_value * 100) %>%
  drop_na(percentage_change)

clean_unemployment$percentage_change <- 
  round(clean_unemployment$percentage_change, 3)

Descriptive Statistics and Visualizations

Code
clean_unemployment %>%
  summary()
   fips_code      state_area           region               year     
 Min.   : 1.00   Length:29327       Length:29327       Min.   :1976  
 1st Qu.:17.00   Class :character   Class :character   1st Qu.:1987  
 Median :30.00   Mode  :character   Mode  :character   Median :1999  
 Mean   :29.12                                         Mean   :1999  
 3rd Qu.:41.50                                         3rd Qu.:2011  
 Max.   :56.00                                         Max.   :2022  
 total_civilian_non_institutional_population_in_state_area
 Min.   :  232000                                         
 1st Qu.: 1086000                                         
 Median : 2861000                                         
 Mean   : 4197551                                         
 3rd Qu.: 5133131                                         
 Max.   :31236439                                         
 total_civilian_labor_force_in_state_area percent_of_state_area_s_population
 Min.   :  160022                         Min.   :51.00                     
 1st Qu.:  705984                         1st Qu.:62.90                     
 Median : 1833930                         Median :66.00                     
 Mean   : 2718250                         Mean   :65.67                     
 3rd Qu.: 3364976                         3rd Qu.:68.50                     
 Max.   :19600700                         Max.   :75.70                     
 total_employment_in_state_area percent_of_labor_force_employed_in_state_area
 Min.   :  148718               Min.   :41.60                                
 1st Qu.:  668042               1st Qu.:58.80                                
 Median : 1716143               Median :62.00                                
 Mean   : 2550745               Mean   :61.83                                
 3rd Qu.: 3193095               3rd Qu.:65.10                                
 Max.   :18754316               Max.   :73.10                                
 total_unemployment_in_state_area
 Min.   :   4980                 
 1st Qu.:  36232                 
 Median : 101518                 
 Mean   : 167505                 
 3rd Qu.: 200385                 
 Max.   :3018611                 
 percent_of_labor_force_unemployed_in_state_area
 Min.   : 1.900                                 
 1st Qu.: 4.300                                 
 Median : 5.500                                 
 Mean   : 5.885                                 
 3rd Qu.: 7.100                                 
 Max.   :30.600                                 
 percentage_total_civilian_non_institutional_pop  lagged_value    
 Min.   :0.000200                                Min.   :   4980  
 1st Qu.:0.000900                                1st Qu.:  36235  
 Median :0.002300                                Median : 101518  
 Mean   :0.003411                                Mean   : 167508  
 3rd Qu.:0.004200                                3rd Qu.: 200385  
 Max.   :0.025400                                Max.   :3018611  
 percentage_change 
 Min.   : -95.190  
 1st Qu.:  -1.294  
 Median :  -0.304  
 Mean   :   0.826  
 3rd Qu.:   0.854  
 Max.   :4262.441  

Using summary statistics on four specific variables that I believe to be relevant to the analysis. This gives a brief overview on the similarities and differences between what I believe to be the most important variables.

Code
clean_unemployment %>%
  select(total_civilian_non_institutional_population_in_state_area, total_unemployment_in_state_area, percent_of_labor_force_unemployed_in_state_area,
         percent_of_state_area_s_population) %>%
  summary()
 total_civilian_non_institutional_population_in_state_area
 Min.   :  232000                                         
 1st Qu.: 1086000                                         
 Median : 2861000                                         
 Mean   : 4197551                                         
 3rd Qu.: 5133131                                         
 Max.   :31236439                                         
 total_unemployment_in_state_area
 Min.   :   4980                 
 1st Qu.:  36232                 
 Median : 101518                 
 Mean   : 167505                 
 3rd Qu.: 200385                 
 Max.   :3018611                 
 percent_of_labor_force_unemployed_in_state_area
 Min.   : 1.900                                 
 1st Qu.: 4.300                                 
 Median : 5.500                                 
 Mean   : 5.885                                 
 3rd Qu.: 7.100                                 
 Max.   :30.600                                 
 percent_of_state_area_s_population
 Min.   :51.00                     
 1st Qu.:62.90                     
 Median :66.00                     
 Mean   :65.67                     
 3rd Qu.:68.50                     
 Max.   :75.70                     
Subsets for High and Low Unemployment rates per Region
Code
# Average unemployment rate for each state

average_unemployment <- clean_unemployment %>%
  group_by(region) %>%
  summarize(avg_unemployment = mean(total_unemployment_in_state_area),
            avg_labor_force = mean(total_civilian_labor_force_in_state_area),
     yearly_avg_unemployment = mean(percent_of_labor_force_unemployed_in_state_area),
     avg_population = mean(total_civilian_labor_force_in_state_area))

# Top 5 lowest and highest average

top_5_highest <- average_unemployment %>%
  top_n(5, wt = avg_unemployment)

top_5_lowest <- average_unemployment %>%
  arrange(avg_unemployment) %>%
  slice(1:5)

# Subset data for top 5 highest and lowest
unemployment_high_unemployment <- clean_unemployment %>%
  filter(region %in% top_5_highest$region) %>%
  select(region, year, total_unemployment_in_state_area, total_civilian_labor_force_in_state_area) %>%
  arrange(desc(year))
  

unemployment_low_unemployment <- clean_unemployment %>%
  filter(region %in% top_5_lowest$region) 
Code
clean_unemployment %>%
  ggplot(mapping = aes(
    x = total_civilian_non_institutional_population_in_state_area,
    y = total_unemployment_in_state_area, color = total_civilian_non_institutional_population_in_state_area)) +
  geom_point() +
  geom_jitter() +
  geom_line() +
  labs(
    x = "Total Civilian Non-Insitutional Population in State Area",
    y = "Total Unemployment in State Area",
    title = "Total Unemployment in State Area vs. Total Civilian Non-Instituional Population in State Area",
    subtitle = "Line Graph",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

Code
clean_unemployment %>%
  ggplot(mapping = aes(
    x = total_unemployment_in_state_area,
    y = state_area)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Total Unemployment in State Area",
    y = "State Area",
    title = "Total Unemployment in State Area vs. FIPS Code",
    subtitle = "Linear Regression Model",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Code
set.seed(1234)
clean_unemployment %>%
  ggplot(mapping = aes(
    x = total_unemployment_in_state_area)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = "lightblue", color = "black") +
  geom_density(color = "blue") +
  labs(
    x = "Total Unemployment in State Area",
    y = "Density",
    title = "Total Unemployment in State Area",
    subtitle = "Histogram and Density Plot",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Code
set.seed(1234)
average_unemployment %>%
  ggplot(mapping = aes(
    x = avg_unemployment)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = "lightblue", color = "black") +
  geom_density(color = "red") +
  labs(
    x = "Average Unemployment for Each State",
    y = "Density",
    title = "States Average Unemployment",
    subtitle = "Histogram and Density Plot of Current Unemployment",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

Average unemployment seems to be decreasing over time for most of the states in the US. This is a good sign that the economy is improving. However, there are still some states that have a higher average unemployment rate than others. This could be due to a number of factors, such as the state’s economy, the state’s population, and the state’s unemployment rate.

Explore Relationship Between Unemployment and Non-Institutional Population per State Area
Code
clean_unemployment %>%
  group_by(region) %>%
  ggplot(aes(
    x = total_civilian_non_institutional_population_in_state_area,
    y = percent_of_labor_force_unemployed_in_state_area,
    color = region)) +
  geom_point(aes(year)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_jitter() +
  labs(
    x = "Total Civilian Non-Insitutional Population in State Area",
    y = "Percent of Labor Force Unemployed in State Area",
    title = "Percent of Labor Force Unemployed vs. Total Civilian Non-Instituional Population in State Area",
    subtitle = "Linear Regression Model",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Code
clean_unemployment %>%
  group_by(region) %>%
  ggplot(aes(
    x = total_civilian_labor_force_in_state_area,
    y = total_unemployment_in_state_area,
    color = region)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_jitter() +
  labs(
    x = "Total Civilian Labor Force in State Area",
    y = "Total Unemployment in State Area",
    title = "Total Civilian Labor Force vs. Total Unemployment") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The more interesting relationship is between the total civilian non-institutional population in state area and the total unemployment in state area is that not only does the west have a higher probability of higher unemployment labor force, but also a higher non-institutional population. This could be due to the fact that the west has a higher population than the other regions.

Furthermore, the second plot shows that the West and South have more significant changes in the data. With the West having a more positive slope and the South having a more negative slope. The Midwest and Northeast have a more neutral slope. Which means that in the Western States there is a positive relationship between unemployment and non institutional population. In the South there is a negative relationship between unemployment and non institutional population. In the Midwest and Northeast there is no relationship between unemployment and non institutional population.

Highest and Lowest Unemployment Over Time (2000-2019) per Region
Code
# Unemployment Population
 clean_unemployment %>%
  filter(year >= 1976,
         region == c('West', 'South','Midwest','Northeast')) %>%
  ggplot(aes(
    x = year,
    y = total_unemployment_in_state_area,
    color = region)) +
  geom_point() +
 #geom_smooth(method = "lm", se = FALSE) +
  geom_jitter() +
  labs(
    x = "Year",
    y = "Total Unemployment in State Area",
    shape = "Year",
    title = "Total Unemployment in State Area vs. Year",
    subtitle = "Linear Regression Model",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()
Warning: There was 1 warning in `filter()`.
ℹ In argument: `region == c("West", "South", "Midwest", "Northeast")`.
Caused by warning in `region == c("West", "South", "Midwest", "Northeast")`:
! longer object length is not a multiple of shorter object length

Code
# Non-Institutional Population 
clean_unemployment %>%
  filter(year >= 2016,
         region == c('West', 'South','Midwest','Northeast')) %>% 
  ggplot(aes(
    y = total_civilian_non_institutional_population_in_state_area,
    x = year,
    color = region)) +
  geom_point() +
 #geom_smooth(method = "lm", se = FALSE) +
  geom_jitter() +
  labs(
    x = "Year",
    y = "Total Non_Institutional Population in State Area",
    shape = "Year",
    title = "Total Non-Institutional Population in State Area vs. Year",
    subtitle = "Linear Regression Model",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal() 
Warning: There was 1 warning in `filter()`.
ℹ In argument: `region == c("West", "South", "Midwest", "Northeast")`.
Caused by warning in `region == c("West", "South", "Midwest", "Northeast")`:
! longer object length is not a multiple of shorter object length

Code
sub_high_unemployment <- unemployment_high_unemployment %>%
  group_by(region) %>%
  mutate(
    avg_labor_force = mean(
      total_civilian_labor_force_in_state_area),
    avg_unemployment = mean(total_unemployment_in_state_area))

sub_low_unemployment <- unemployment_low_unemployment %>%
  group_by(region) %>%
  mutate(
    avg_labor_force =
      mean(total_civilian_labor_force_in_state_area),
    avg_unemployment = mean(total_unemployment_in_state_area))

sub_high_unemployment %>%
  ggplot(aes(
    x = ,
    y = avg_unemployment,
    color = region)) +
  geom_boxplot() +
  labs(
    x = "Average Total Civilian Non-Insitutional Population in State Area",
    y = "Average Total Unemployment in State Area",
    title = "Average Total Unemployment in State Area vs. Average Total Civilian Non-Instituional Population in State Area",
    subtitle = "Linear Regression Model",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

Variable Relationships

Code
unemployment_subset <- clean_unemployment %>%
  select(-state_area, -fips_code, -region, -year)

cor_matrix <- cor(unemployment_subset)

unemployment_subset %>% 
  cor() %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything()) %>% 
  filter(value != 1) %>% 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free") +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Histogram",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

Code
 library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Code
#| label: pair plot visual

unemployment_sub <- clean_unemployment %>%
  select(percent_of_state_area_s_population,
         percent_of_labor_force_employed_in_state_area,
         percent_of_labor_force_unemployed_in_state_area,
         percentage_change) %>%
  sample_n(300)

pair_plot <- unemployment_sub %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot

Code
unemployment_sub_2 <- clean_unemployment %>%
  select(percent_of_state_area_s_population,
         percent_of_labor_force_employed_in_state_area,
        ) %>%
  sample_n(300)

pair_plot_2 <- unemployment_sub_2 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot_2

Code
unemployment_sub_3 <- clean_unemployment %>%
  select(percent_of_state_area_s_population,
         percent_of_labor_force_unemployed_in_state_area,
        ) %>%
  sample_n(300)

pair_plot_3 <- unemployment_sub_3 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot_3

Code
unemployment_sub_4 <- clean_unemployment %>%
  select(percent_of_labor_force_unemployed_in_state_area,
         percentage_total_civilian_non_institutional_pop,
        ) %>%
  sample_n(300)

pair_plot_4 <- unemployment_sub_4 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot_4

Code
#unemployment_sub_5 <- clean_unemployment %>%
  #select(percent_of_labor_force_employed_in_state_area,
         #percentage_total_civilian_non_institutional_pop,
       # ) %>%
  #sample_n(300)

#pair_plot_5 <- unemployment_sub_5 %>%
  #ggpairs(ggplot2::aes(fill = "purple")) +
  #labs(
    #x = "Correlation",
    #y = "Count",
    #title = "Correlation Distribution",
    #subtitle = "Scatterplot Matrix",
    #caption = "Data from the Bureau of Labor Statistics") +
  #theme_minimal()

#pair_plot_5

unemployment_sub_6 <- clean_unemployment %>%
  select(percent_of_state_area_s_population,
         percentage_change,
        ) %>%
  sample_n(300)

pair_plot_6 <- unemployment_sub_6 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot_6

Code
unemployment_sub_7 <- clean_unemployment %>%
  select(percent_of_labor_force_employed_in_state_area,
         percent_of_labor_force_unemployed_in_state_area,
        ) %>%
  sample_n(300)

pair_plot_7 <- unemployment_sub_7 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot_7

Code
#Better correlation than the others
unemployment_sub_8 <- clean_unemployment %>%
  select(year,
         percent_of_labor_force_employed_in_state_area,
        ) %>%
  sample_n(300)

pair_plot_8 <- unemployment_sub_8 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot_8

Code
# Correlation is negative but skewed left, like non-institutional population
unemployment_sub_9 <- clean_unemployment %>%
  select(year,
         percent_of_labor_force_unemployed_in_state_area,
        ) %>%
  sample_n(300)

pair_plot_9 <- unemployment_sub_9 %>%
  ggpairs(ggplot2::aes(fill = "purple")) +
  labs(
    x = "Correlation",
    y = "Count",
    title = "Correlation Distribution",
    subtitle = "Scatterplot Matrix",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

pair_plot_9

Predictor-Response Relationships

Code
unemployment_sub <- clean_unemployment %>%
  select(percent_of_state_area_s_population,
         percent_of_labor_force_employed_in_state_area,
         percent_of_labor_force_unemployed_in_state_area,
         total_unemployment_in_state_area,
         total_civilian_labor_force_in_state_area,
         percentage_total_civilian_non_institutional_pop
         ) %>%
  sample_n(300)

unemployment_sub %>%
  ggplot(mapping = aes(
    x = percent_of_state_area_s_population,
    y = total_unemployment_in_state_area)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percent of State Area's Population",
    y = "Total Unemployment in State Area",
    title = "Total Unemployment in State Area vs. Percent of State Area's Population",
    subtitle = "Linear Regression Model 1",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Code
unemployment_sub %>%
  ggplot(mapping = aes(
    x = percent_of_labor_force_employed_in_state_area,
    y = total_unemployment_in_state_area)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percent of Labor Force Employed in State Area",
    y = "Total Unemployment in State Area",
    title = "Total Unemployment in State Area vs. Percent of Labor Force Employed in State Area",
    subtitle = "Linear Regression Model 2",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Code
unemployment_sub %>%
  ggplot(mapping = aes(
    x = percent_of_labor_force_unemployed_in_state_area,
    y = total_civilian_labor_force_in_state_area)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percent of Labor Force Unemployed in State Area",
    y = "Total Civilian Labor Force in State Area",
    title = "Total Civilian Labor Force in State Area vs. Percent of Labor Force Unemployed in State Area",
    subtitle = "Linear Regression Model 3",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Code
# Despite the correlation above in the pair plots, there does seem to be a positive relationship between the two variables. Unemployment and the Population in prison are both increasing over time.
unemployment_sub %>%
  ggplot(aes(
    x = percentage_total_civilian_non_institutional_pop,
    y = total_unemployment_in_state_area)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Percentage of Total Civilian Non-Institutional Population",
    y = "Total Unemployment in State Area",
    title = "Total Unemployment in State Area vs. Percentage of Total Civilian Non-Institutional Population",
    subtitle = "Linear Regression Model 4",
    caption = "Data from the Bureau of Labor Statistics") 
`geom_smooth()` using formula = 'y ~ x'

Model 4 shows a positive and significant relationship between non-institutional population and unemployment. This is a good sign for our model, as it shows that the data is not random and that there is a relationship between the variables. Policies that are related to the non-institutional population can be related to unemployment, and it is important to understand the size of both populations, in order for policymakers to assess the impact of their initiatives on a much broader scale. Both populations defined show the potential of the size of the labor force, if all were to be employed.

Top States with the Highest Unemployment Rate by Year and Comparison of Non-Institutional Population rates by the same Year
Code
top_5_states <- unemployment %>%
  group_by(year, state_area) %>%
  summarize(
    max_unemployment = max(percent_of_labor_force_unemployed_in_state_area),
    max_labor_force = max(percent_of_labor_force_employed_in_state_area),
  ) %>%
  arrange(desc(max_unemployment))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Code
print(top_5_states)
# A tibble: 2,491 × 4
# Groups:   year [47]
    year state_area         max_unemployment max_labor_force
   <int> <chr>                         <dbl>           <dbl>
 1  2020 Nevada                         30.6            61.7
 2  2020 Hawaii                         22.6            59.9
 3  2020 Michigan                       22.6            59.2
 4  2020 New York city                  21.4            57.8
 5  2020 Los Angeles County             18.8            61.9
 6  1983 West Virginia                  18.4            43.7
 7  2020 Illinois                       18              61.9
 8  2020 Rhode Island                   18              62.2
 9  1982 West Virginia                  17.9            47.3
10  2020 Massachusetts                  16.9            64.6
# ℹ 2,481 more rows
Code
#| label: percent of unemployed labor force and non-institutional population by year. 
#| 
mu_unemployment_sd <- clean_unemployment %>%
  group_by(year, region) %>%
  summarize(across(c(percent_of_labor_force_unemployed_in_state_area),
            list(mu = ~ mean(.), sigma = ~ sd(.)))) %>%
  arrange(desc(percent_of_labor_force_unemployed_in_state_area_mu))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
Code
print(mu_unemployment_sd)
# A tibble: 188 × 4
# Groups:   year [47]
    year region  percent_of_labor_force_unemployed_in_s…¹ percent_of_labor_for…²
   <dbl> <chr>                                      <dbl>                  <dbl>
 1  1983 South                                      10.0                    2.78
 2  1982 South                                       9.70                   2.33
 3  2010 West                                        9.40                   2.17
 4  2010 South                                       9.20                   1.46
 5  1982 Midwest                                     9.18                   3.19
 6  1983 West                                        9.15                   1.63
 7  1982 West                                        9.13                   1.81
 8  1983 Midwest                                     9.01                   3.04
 9  2009 South                                       8.96                   1.68
10  2011 West                                        8.82                   2.20
# ℹ 178 more rows
# ℹ abbreviated names: ¹​percent_of_labor_force_unemployed_in_state_area_mu,
#   ²​percent_of_labor_force_unemployed_in_state_area_sigma

Time Series Analysis of Percdentage Change in Unemployment Rate by Year

Code
ggplot(mu_unemployment_sd, aes(x = year, y = percent_of_labor_force_unemployed_in_state_area_mu)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymin = percent_of_labor_force_unemployed_in_state_area_mu - percent_of_labor_force_unemployed_in_state_area_sigma,
                    ymax = percent_of_labor_force_unemployed_in_state_area_mu + percent_of_labor_force_unemployed_in_state_area_sigma),
                width = 0.2) +
  facet_wrap(~region, ncol = 2) +
  labs(
    x = "Year",
    y = "Percent of Labor Force Unemployed By Region",
    title = "Percent of Labor Force Unemployed in Region By Year",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

Top States With The Percentage Change in Unemployment Rate by Year
Code
increase_unemployment_year <- clean_unemployment %>%
  filter(percentage_change > 0) %>%
  select(year, state_area, percentage_change) %>%
  arrange(desc(percentage_change))

#label: ggplot showing increase of unemployment over time

ggplot(increase_unemployment_year, aes(x = year, y = percentage_change, color = state_area)) +
 geom_line()+
  labs(
    x = "Year",
    y = "Percentage Change",
    title = "Percentage Change in Unemployment Rate by Year",
    subtitle = "Top States with the Highest Percentage Change in Unemployment Rate",
    caption = "Data from the Bureau of Labor Statistics") +
  theme_minimal()

Brief Expected Conclusions and Models/Techniques to be Used

The goal of this project is to determine if there is a relationship between unemployment, the percentage of change in unemployment (whether negative or positive), and how these changes are represented nationally given certain years. I will be using linear regression to determine if there is a relationship between the two variables. I will also be using a time series analysis to determine if there is a relationship between the two variables over time. I will be using the data from the Bureau of Labor Statistics and the Bureau of Justice Statistics to determine the unemployment rate and the data from the Bureau of Justice Statistics to determine the prison population.

Preliminary Results

Model Selection

Model fit for chosen data sets, to get a ‘sense’ of the problem. Followed by what is believed to be the best model to refine and test original hypothesis, data analysis and exploration through plots.

Split and Test Various Data Sets
Code
set.seed(93422)

unemployment_split <- clean_unemployment %>% 
  initial_split(prop = 0.9)

unemployment_train <- unemployment_split %>% training()
unemployment_test <- unemployment_split %>% testing()

dim(clean_unemployment)
[1] 29327    14
Code
dim(unemployment_train)
[1] 26394    14
Code
dim(unemployment_test)
[1] 2933   14
Code
#|label: fit the models for training data

lr_mod <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")
Code
unemployment_lm1 <- lr_mod %>%
  fit(year ~ percentage_change, data = unemployment_train)

unemployment_lm2 <- lr_mod %>%
  fit(year ~ poly(percentage_change, 2), data = unemployment_train)

unemployment_test <- unemployment_test %>%
  mutate(
    pred_1 = predict(unemployment_lm1,
                     new_data= unemployment_test,
                     type = "raw"),
    pred_2 = predict(unemployment_lm2, 
                     new_data = unemployment_test,
                     type = "raw")
  )

unemployment_test %>%
  rmse(truth = percentage_change, estimate = pred_1)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1999.
Code
unemployment_test %>%
  rmse(truth = percentage_change, estimate = pred_2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1999.
Code
unemployment_lm1 <- lr_mod %>%
  fit(year ~ percent_of_labor_force_unemployed_in_state_area, data = unemployment_train)

unemployment_lm2 <- lr_mod %>%
  fit(year ~ poly(percent_of_labor_force_unemployed_in_state_area, 2), data = unemployment_train)

unemployment_test %>%
  rmse(truth = percent_of_labor_force_unemployed_in_state_area, estimate = pred_1)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1993.
Code
unemployment_test %>%
  rmse(truth = percent_of_labor_force_unemployed_in_state_area, estimate = pred_2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1993.

Cross Validation and KNN Model

Code
set.seed(1234)
cvs <- vfold_cv(clean_unemployment, v = 5, strata = region)

cvs_recipe <- recipe(region ~ ., data = clean_unemployment) %>%
  step_rm(state_area) %>%
  step_normalize(all_predictors())

cvs_recipe
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 13
── Operations 
• Variables removed: state_area
• Centering and scaling for: all_predictors()
Code
knn_mod <- nearest_neighbor(neighbors = 5) %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_wflow <- workflow() %>%
  add_recipe(cvs_recipe) %>%
  add_model(knn_mod)

knn_fit <- knn_wflow %>%
  fit_resamples(cvs)
Code
knn_fit <- knn_wflow %>%
  fit_resamples(cvs, 
                metrics = metric_set(roc_auc, accuracy, precision))

knn_fit %>%
  collect_metrics()
# A tibble: 3 × 6
  .metric   .estimator  mean     n  std_err .config             
  <chr>     <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy  multiclass 0.996     5 0.000292 Preprocessor1_Model1
2 precision macro      0.996     5 0.000301 Preprocessor1_Model1
3 roc_auc   hand_till  1.00      5 0.000135 Preprocessor1_Model1

I tried various folds to make sure the model was not overfitting. I also tried various neighbors to see which one would give me the best results. I found that the model was not overfitting and that the best number of neighbors was 5. I also found that the model did not overfit when I used the percentage change in unemployment rate and the percent of labor force unemployed in state area.

Use tidyModels textbook to check the prediction with original value of particular columns.

Predictive Models

Code
#|label: Decision Tree Model
set.seed(93422)

tree_mod <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_wflow <- workflow() %>%
  add_recipe(cvs_recipe) %>%
  add_model(tree_mod)
Code
#|label: tree fit

tree_fit <- tree_wflow %>%
  fit_resamples(cvs,
            metrics = metric_set(accuracy, roc_auc, precision))

tree_fit %>%
  collect_metrics()
# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config             
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy  multiclass 0.871     5 0.00990 Preprocessor1_Model1
2 precision macro      0.880     5 0.00971 Preprocessor1_Model1
3 roc_auc   hand_till  0.968     5 0.00346 Preprocessor1_Model1
Code
#|label: Inspect the fit of the tree model

tree_fit_results <- tree_wflow %>%
  fit(clean_unemployment)

tree_fitted <- tree_fit_results %>%
  extract_fit_parsnip()

rpart.plot(tree_fitted$fit)
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
    Call rpart.plot with roundint=FALSE,
    or rebuild the rpart model with model=TRUE.

Code
tree_grid <- grid_regular(min_n(c(2, 20)),
                          levels = 4)

tree_grid
# A tibble: 4 × 1
  min_n
  <int>
1     2
2     8
3    14
4    20
Code
tree_mod <- decision_tree(min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_wflow <- workflow() %>%
  add_recipe(cvs_recipe) %>%
  add_model(tree_mod)

tree_grid_search <- 
  tune_grid(
    tree_wflow,
    resamples = cvs,
    grid = tree_grid
  )

tuning_metrics <- tree_grid_search %>%
  collect_metrics()

tuning_metrics
# A tibble: 8 × 7
  min_n .metric  .estimator  mean     n std_err .config             
  <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1     2 accuracy multiclass 0.871     5 0.00990 Preprocessor1_Model1
2     2 roc_auc  hand_till  0.968     5 0.00346 Preprocessor1_Model1
3     8 accuracy multiclass 0.871     5 0.00990 Preprocessor1_Model2
4     8 roc_auc  hand_till  0.968     5 0.00346 Preprocessor1_Model2
5    14 accuracy multiclass 0.871     5 0.00990 Preprocessor1_Model3
6    14 roc_auc  hand_till  0.968     5 0.00346 Preprocessor1_Model3
7    20 accuracy multiclass 0.871     5 0.00990 Preprocessor1_Model4
8    20 roc_auc  hand_till  0.968     5 0.00346 Preprocessor1_Model4
Code
tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          min_n(),
                          levels = 2)
tree_grid
# A tibble: 8 × 3
  cost_complexity tree_depth min_n
            <dbl>      <int> <int>
1    0.0000000001          1     2
2    0.1                   1     2
3    0.0000000001         15     2
4    0.1                  15     2
5    0.0000000001          1    40
6    0.1                   1    40
7    0.0000000001         15    40
8    0.1                  15    40
Code
tree_mod <- decision_tree(cost_complexity = tune(),
                          tree_depth = tune(),
                          min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_wflow <- workflow() %>%
  add_recipe(cvs_recipe) %>%
  add_model(tree_mod)

tree_grid_search <-
  tune_grid(
    tree_wflow,
    resamples = cvs,
    grid = tree_grid
  )

tuning_metrics <- tree_grid_search %>%
  collect_metrics()

tuning_metrics
# A tibble: 16 × 9
   cost_complexity tree_depth min_n .metric  .estimator  mean     n   std_err
             <dbl>      <int> <int> <chr>    <chr>      <dbl> <int>     <dbl>
 1    0.0000000001          1     2 accuracy multiclass 0.386     5 0.00146  
 2    0.0000000001          1     2 roc_auc  hand_till  0.588     5 0.00126  
 3    0.1                   1     2 accuracy multiclass 0.327     5 0.0000210
 4    0.1                   1     2 roc_auc  hand_till  0.5       5 0        
 5    0.0000000001         15     2 accuracy multiclass 0.998     5 0.000158 
 6    0.0000000001         15     2 roc_auc  hand_till  0.999     5 0.000103 
 7    0.1                  15     2 accuracy multiclass 0.355     5 0.0283   
 8    0.1                  15     2 roc_auc  hand_till  0.532     5 0.0322   
 9    0.0000000001          1    40 accuracy multiclass 0.386     5 0.00146  
10    0.0000000001          1    40 roc_auc  hand_till  0.588     5 0.00126  
11    0.1                   1    40 accuracy multiclass 0.327     5 0.0000210
12    0.1                   1    40 roc_auc  hand_till  0.5       5 0        
13    0.0000000001         15    40 accuracy multiclass 0.991     5 0.000643 
14    0.0000000001         15    40 roc_auc  hand_till  0.999     5 0.0000999
15    0.1                  15    40 accuracy multiclass 0.355     5 0.0283   
16    0.1                  15    40 roc_auc  hand_till  0.532     5 0.0322   
# ℹ 1 more variable: .config <chr>
Code
#| label: Tuning

tuning_metrics %>%
  filter(.metric == "accuracy") %>%
  slice_max(mean)
# A tibble: 1 × 9
  cost_complexity tree_depth min_n .metric  .estimator  mean     n  std_err
            <dbl>      <int> <int> <chr>    <chr>      <dbl> <int>    <dbl>
1    0.0000000001         15     2 accuracy multiclass 0.998     5 0.000158
# ℹ 1 more variable: .config <chr>
Code
tuning_metrics %>%
  filter(.metric == "roc_auc") %>%
  slice_max(mean)
# A tibble: 1 × 9
  cost_complexity tree_depth min_n .metric .estimator  mean     n   std_err
            <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>     <dbl>
1    0.0000000001         15    40 roc_auc hand_till  0.999     5 0.0000999
# ℹ 1 more variable: .config <chr>
Code
rpart.plot(tree_fitted$fit,
           roundint = FALSE,
           cex = 0.3)

Final Conclusions of Analysis

References